Introduction
In this project we use a dataset of wine reviews to predict review points from numerical, categorical and textual predictors.
The data is from Kaggle Datasets, and covers 150k wine reviews along with some attributes of the wines. It can be found here. (A (free) Kaggle login is required to access it directly from kaggle.com). The data was originally scraped from WineEnthusiast.
The dataset contains the following columns:
Points: the number of points the wine received in its review, on a scale of 1-100. However, only wines with >= 80 points were published in the dataset. We plan to transform this feature and check out logit, probit, BoxCox and log transforms. This will be our response variable.
Description: a description of the wine’s taste, smell, look, feel, etc. We plan to use LDA topic modeling to convert the text description into a small vector of numerical predictors. We plan to generate a “sentiment” numeric predictor using a sentiment classifer. This is an additive predictor.
Price: the cost for a bottle of the wine. We will transform this using log or inverse or a power of the inverse (TBD by CV), or any combination of these.
Variety: the type of grapes used
Country: the country that the wine is from
This is a particularly interesting problem for several reasons:
- We need to transform both predictors and response, which makes for a more interesting problem. We can dig into research of optimal transforms using cross validation over a grid of models.
- The regression via glm shows heavy tails on the errors in the QQ plot. Because we have 137k data points, this offers us an opportunity to argue that by the CLT the heavy tails do not obviate our use of t and F tests. We can explore the pitfalls of heavy tails and try to use the “heavy” package to do a glm with heavy tails. Again, this is more interesting than a more straight-forward problem.
- Chance to use LDA topic modeling and sentiment analysis
Methods
Feature Engineering
- The country was used to compute the continentTopFive, which was added as a categorical feature. This feature is the continent of the country of origin of the wine, with the separation of the top 5 wine producers in Europe into their own class. These include France, Germany, Italy, Spain and Portugal. The levels are indicated in the plot of points vs log(price), by continentTopFive.

- LDA for 2 topics was performed on the description of the wine. The probability of being in the first topic was added as a feature named “topic1”.
The sentiment of the description of the wine was computed using bing. Sentiment was added as a numeric predictor, named “sentiment”, ranging from -6 to 11.
- The resulting predictors after feature enginering are shown below, with ‘points’ as the response.
Because price ranges from 4 to 2300 USD, and is positive, ranging over 3 orders of magnitude, log(price) is used as a predictor, instead of price.
## [1] "points" "price" "continentTopFive"
## [4] "topic1" "sentiment"
Training and Test Set Generation
We create an 80% training, 20% test split for later use in cross validation.
Model Selection
Model #1
- The model started with the base of: \(log(points-79.999) \sim log(price)*continentTopFive*sentiment*topic1 + I(log(price))^2 + I(log(price))^3\)
- The optimal model after backward BIC selection was: \(log(points - 79.999) \sim log(price) + continentTopFive + topic1 + log(price):continentTopFive + continentTopFive:topic1\)




##
## studentized Breusch-Pagan test
##
## data: mod1
## BP = 2824.7, df = 20, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: sample(resid(mod1), 5000)
## W = 0.91738, p-value < 2.2e-16
- Outliers, in which the residual had magnitude over 7 points, were removed. Highly influential points were removed and the model was refit. 1283 observations were removed in total.
- 10-fold cross-validation yielded a mean test RMSE of 2.5611389.
Model 2
- The model started with the base of: \(points \sim log(price)*continentTopFive*sentiment*topic1 + I(log(price))^2 + I(log(price))^3\)
- The optimal model after backward BIC selection was: \(points \sim log(price) + continentTopFive + topic1 + log(price):continentTopFive + log(price):topic1 + continentTopFive:topic1\)




##
## studentized Breusch-Pagan test
##
## data: mod2
## BP = 5604, df = 21, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: sample(resid(mod2), 5000)
## W = 0.99597, p-value = 1.765e-10
- Highly influential points were removed and the model was refit. 3422 observations were removed in total.
- 10-fold cross-validation yielded a mean test RMSE of 2.4331344.
Model 3
- The Box-Cox transform of the response was performed to try to counteract the indicated heteroskedasticity.
- The model started with a base of: \(bc\_points \sim log(price)*continentTopFive*sentiment*topic1 + I(log(price))^2 + I(log(price))^3\)
- The optimal model after backward BIC selection was: \(bc\_points \sim log(price) + continentTopFive + topic1 + log(price):continentTopFive + log(price):topic1 + continentTopFive:topic1\)




##
## studentized Breusch-Pagan test
##
## data: mod3
## BP = 4521.8, df = 21, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: sample(resid(mod3), 5000)
## W = 0.99349, p-value = 2.522e-14
- Highly influential points were removed and the model was refit. 0 observations were removed in total.
- 10-fold cross-validation yielded a mean test RMSE of 2.4554481.
Model 4
- Weighted least squares was performed, with weight being proportional to the inverse of the observed variance of points. The observed variance of residuals peaks at points = 90 and tapers to the ends of the points range. The weight function therefore assumes its minimum at points = 90 and increases at the ends of the range of points.
- The weight function used: \(weights = sqrt(1 / 1 + abs(points -90))\)
- The model started with a base of: \(points \sim log(price)*continentTopFive*sentiment*topic1 + I(log(price))^2 + I(log(price))^3\)
- The optimal model after backward BIC selection was: \(points \sim log(price) + continentTopFive + topic1 + log(price):continentTopFive + log(price):topic1 + continentTopFive:topic1\)




##
## studentized Breusch-Pagan test
##
## data: mod4
## BP = 4464.5, df = 21, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: sample(resid(mod4), 5000)
## W = 0.99624, p-value = 5.742e-10
- Highly influential points were removed and the model was refit. 3117 observations were removed in total.
- 10-fold cross-validation yielded a mean test RMSE of 2.5705648.
Results
- 4 models were analyzed as candidates for optimal prediction of points.
- For each, a base model was constructed and backward BIC selection was used to trim the model.
- As seen by the plots and the BP and Shapiro-Wilk tests, normality of the errors is suspect, and heteroskedasticity is indicated for all models. In particular, the variance seems highest for points around 90, and tapers to the extremes. Note however, that since Shapiro-Wilk in R can take a maximum of 5000 residuals, sampling was required, and this test is then dependent on the choice of random seed used in the sampling.
- Many transforms (the 4 here, and not shown) were attemped in order to overcome these violations of the assumptions of multiple linear regression by least squares minimization.
- 10 fold cross-validation was performed on all models, with mean CV RMSE given as below.
| 1 |
2.561 |
| 2 |
2.433 |
| 3 |
2.455 |
| 4 |
2.571 |
- Model #2 wins by minimal CV RMSE.
- The CV RMSE can be used to construct prediction intervals, in lieu of analytic formulae based on normal and homoskedastic errors, as \(\hat{y} \pm 2*2.433\) for the prediction interval.
preds = predict(mod2, newdata=wine_test)
success_ratio = sum(wine_test$points >= preds - 2*2.433 & wine_test$points <= preds + 2*2.433) / nrow(wine_test)
- Using Model 2 with this prediction interval resulted in a success ratio of 0.9434985 on the withheld test set (i.e. this percentage of the withheld test set points were within the prediction interval of Model #2).
Discussion
TODO: Instructions: “The discussion section should contain discussion of your results. This should frame your results in the context of the data. How is your final model useful?”
Appendix
About the LDA
- Since we did not want to wander too far into topic modelling of the wine descriptions, an LDA with only 2 topics was performed. Future extension to this work should try to mine topics and sentiment further from the somelier description, via NLP techniques.
- Our LDA treatment relies on 2 references regarding the ‘topicmodels’ library of R:
- LDA with 2 topics creates 2 probability distributions of words discovered in the processed wine descriptions (after removing punctuation, white space, casting to lower case and stemming, which means taking word roots). Each of these probability distributions is a “topic”. Each description receives a probability of coming from topic1, with the probability of coming from topic2 being 1 minus this value. This probability of being generated from topic 1 is our “topic1” predictor. Our “topic1” predictor remained significant as an additive and integrated predictor in all BIC reduced models.
- The “beta” of a word in a topic is its probability within the topic probability distribution. Below, topic 1 and topic 2 probabilities for various words are displayed, as is the log ratio of the topic probabilities for some words.

